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Virginia Tech, Yale University and University of Rochester 

We study the Cox models with semiparametric relative risk, which 
can be partially linear with one nonparametric component, or multi- 
ple additive or nonadditive nonparametric components. A penalized 
partial likelihood procedure is proposed to simultaneously estimate 
the parameters and select variables for both the parametric and the 
nonparametric parts. Two penalties are applied sequentially. The first 
penalty, governing the smoothness of the multivariate nonlinear co- 
variate effect function, provides a smoothing spline ANOVA frame- 
work that is exploited to derive an empirical model selection tool for 
the nonparametric part. The second penalty, either the smoothly- 
clipped-absolute-deviation (SCAD) penalty or the adaptive LASSO 
penalty, achieves variable selection in the parametric part. We show 
that the resulting estimator of the parametric part possesses the 
oracle property, and that the estimator of the nonparametric part 
achieves the optimal rate of convergence. The proposed procedures 
are shown to work well in simulation experiments, and then applied 
to a real data example on sexually transmitted diseases. 

1. Introduction. In survival analysis, a problem of interest is to identify 
relevant risk factors and evaluate their contributions to survival time. Cox 
proportional hazards (PH) model is a popular approach to study the influ- 
ence of covariates on survival outcome. Conventional PH models assume that 
covariates have a log- linear effect on the hazard function. These PH mod- 
els have been studied by numerous authors; see, for example, the references 
in [15]. The log-linear assumption can be too rigid in practice, especially 
when continuous covariates are present. This limitation motivates PH mod- 
els with nonparametric relative risk. Some examples are [6, 11, 12, 23, 35]. 



Received August 2009; revised December 2009. 

Supported in part by NIH/NIAID Grant AI59773 and NSF Grant DMS-08-06097. 
AMS 2000 subject classifications. Primary 62N01, 62N03; secondary 62N02. 
Key words and phrases. Backfitting, partially linear models, penalized variable selec- 
tion, proportional hazards, penalized partial likelihood, smoothing spline ANOVA. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2010, Vol. 38, No. 4, 2092-2117. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



P. DU, S. MA AND H. LIANG 



However, nonparametric models may suffer from the curse of dimensional- 
ity. They also lack the easy interpretation in parametric risk models. PH 
models with semiparametric relative risk strike a good balance by allowing 
nonparametric risk for some covariates and parametric risk for others. The 
benefits of such models are two-folds. First, they have the merits of mod- 
els with parametric risk, including easy interpretation, easy estimation and 
easy inference. Second, their nonparametric part allows a flexible form for 
some continuous covariates whose patterns are unexplored and whose con- 
tribution cannot be assessed by simple parametric models. For example, [10] 
proposed efficient estimation for a partially linear Cox model with additive 
nonlinear covariate effects. Reference [3] studied partially linear hazard re- 
gression for multivariate survival data with time-dependent covariates via 
a profile pseudo-partial likelihood approach, where the only nonlinear co- 
variate effect was estimated by local polynomials. But these models are 
limited to one nonparametric component or additive nonparametric com- 
ponents, ignoring the possible interactions between different nonparametric 
components. Reference [30] proposed a partially linear additive hazard model 
whose nonlinear varying coefficients represent the interaction between the 
time-dependent nonlinear covariate and other covariates. 

Variable selection in survival data has drawn much attention in the past 
decade. Traditional procedures such as Akaike information criterion (AIC) 
and Bayesian information criterion (BIC), as noted by [2], suffer from the 
lack of stability and lack of incorporating stochastic errors inherited in the 
stage of variable selection. References [26] and [32] extended, respectively, 
the LASSO and the adaptive LASSO variable selection procedures to the 
Cox model. Reference [8] extended the nonconcave penalized likelihood ap- 
proach [7] to the Cox PH models. Reference [4] studied variable selection 
for multivariate survival data. The Cox models considered in these three pa- 
pers all assumed a linear form of covariate effects in the relative risk. More 
recently, [13] and [14] proposed procedures for selecting variables in semi- 
parametric linear regression models for censored data, where the dependence 
of response over covariates was also assumed to be of linear form. Hence, 
the aforementioned variable selection procedures are limited in their rigid 
assumption of parametric covariate effects which may not be realistic in prac- 
tice. We will fill in these gaps in three aspects: (i) our models are flexible 
with semiparametric relative risk, which allows nonadditive nonparamet- 
ric components, without limiting to single or additive nonlinear covariate 
effects; and (ii) our approach can simultaneously estimate the parametric 
coefficient vector and select contributing parametric components; and (iii) 
our approach also provides a model selection tool for the nonparametric 
components. 

Let the hazard function for a subject be 

(1.1) h(t) = h (t)exp[(3 T U + 7 ] (W)}, 
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where ho is the unknown baseline hazard, Z T = (U T , W T ) is the covariate 
vector, (3 is the unknown coefficient vector, and rj(w) = r](wi, . . . ,w q ) is an 
unknown multivariate smooth function. We propose a doubly penalized pro- 
file partial likelihood approach for estimation, following the general profile 
likelihood framework set up by [20]. Given (3, rj is estimated by smoothing 
splines through the minimization of a penalized log partial likelihood. Then 
the smoothing spline ANOVA decomposition not only allows the natural 
inclusion of interaction effects but also provides the basis for deriving an 
empirical model selection tool. After substituting the estimate of rj, we ob- 
tain a profile partial likelihood, which is then penalized to get an estimate 
of /3. To achieve variable selection in (3, we use the smoothly clipped ab- 
solute deviation (SCAD) penalty. We show that our estimate of rj achieves 
the optimal convergence rate, and our estimate of (3 possesses the oracle 
property such that the true zero coefficients are automatically estimated as 
zeros and the remaining coefficients are estimated as well as if the correct 
submodel were known in advance. Our numerical studies reveal that the 
proposed method is promising in both estimation and variable selection. We 
then apply it to a study on sexually transmitted diseases with 877 subjects. 

The rest of the article is organized as follows. Section 2 gives the details 
of the proposed method, in the order of model description and estimation 
procedure (Section 2.1), model selection in the nonparametric part (Section 
2.2), asymptotic properties (Section 2.3), and miscellaneous issues (Section 
2.4) like standard error estimates and smoothing parameter selection. Sec- 
tion 3 presents the empirical studies, and Section 4 gives an application 
study. Remarks in Section 5 conclude the article. 



2. Method. Let T be the failure time and C be the right-censoring time. 
Assume that T and C are conditionally independent given the covariate. The 
observable random variable is (X, A, Z), where X = min(T, C), A = I\rr<c\t 
and Z = (U,W) is the covariate vector with U G K d and W £ R q . With n 
i.i.d. (Xi, Aj, Zi),i = 1, . . . , n, we assume a Cox model for the hazard function 
as in (1.1). 

2.1. Estimation and variable selection for parametric parts. Let Yi{t) = 
I[Xi>t] ■ We propose to estimate (/3, if) through a penalized profile partial like- 
lihood approach. Given f3, r] is estimated as the minimizer of the penalized 
partial likelihood 



l p (r,) = --J2Ai\ Uff3 + V (Wi) - log Yk(X t ) exp[[/J/3 + V (W k )} 

(2.1) 



n 

i=l (. k=l 
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where the summation is the negative log partial likelihood representing the 
goodness-of-fit, J(rj) is a roughness penalty specifying the smoothness of 
77, and A > is a smoothing parameter controlling the tradeoff. A popular 
choice for J is the L2-penalty which yields tensor product cubic splines (see, 
e.g., [9]) for multivariate W. Note that r\ in (2.1) is identifiable up to a 
constant, so we use the constraint Jrj = 0. 

Once an estimate r) of 77 is obtained, the estimator of (3 is then the max- 
imizer of the penalized profile partial likelihood 

Uff3 + fj(Wi) - log Y k {Xi) exp[lf /3 + fj(W k )} 1 

k=l J 

-nj>,(l&|), 
3=1 

where Pe 3 {\ - | ) is the SCAD penalty on f3 [7]. 

The detailed algorithm for our estimation procedure is as follows. 

Step 1. Find a proper initial estimate We note that, as long as the 
initial estimate is reasonable, convergence to the true optimizer can 
be achieved. Difference choices of the initial estimate will affect the 
number of iterations needed but not the convergence itself. 

Step 2. Let /3( k ~^ be the estimate of f3 before the fcth iteration. Plug /3( fe_1 ) 
into (2.1) and solve for 77 by minimizing the penalized partial likeli- 
hood lp(k-i)(v)- Let r)( fc ) be the estimate thus obtained. 

Step 3. Plug if k ' into (2.2) and solve for f3 by maximizing the penalized 
profile partial likelihood ^(*o(/3). Let /3^ fc - > be the estimate thus ob- 
tained. 

Step 4. Replace ^ k ~^ in step 2 by (3^ and repeat steps 2 and 3 until 
convergence to obtain the final estimates f3 and 77. 

Our experience shows that the algorithm usually converges quickly within 
a few iterations. As in the classical Cox proportional hazards model, the 
estimation of baseline hazard function is of less interest and not required in 
our estimation procedure. 

In step 3, we use a one-step approximation to the SCAD penalty [34]. 
It transforms the SCAD penalty problem to a LASSO-type optimization, 
where the celebrated LARS algorithm proposed in [5] can be used. Let lf)((3) 
be the profile log partial likelihood in step 3, and I((3) = — V 2 Z^(/3) be the 
Hessian matrix, where the derivative is with respect to f3 treating f) as fixed. 
Compute the Cholesky decomposition of /(/fl^ " 1 ^) such that /(Z^ -1 )) = 

V T V. Let A = {j :p' e .(\pf = 0} and B = {j :^.(|4j fc_1) |) > 0}. Decom- 

pose V and the new estimate (3^ k > accordingly such that V = [Va,Vb] and 
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(Step 3a) Let y = V^ k 1 \ Then for each j € B, replace the jth column of 

V by setting Vj = Vj J k n . 

3 V 9j .(l/?j ^1) 

(Step 3b) Let Ha = VAiV^VA) -1 ^^^ be the projection matrix to the column 

space of Va- Compute y* = y — HAy and Vg = Vb — HaVb- 
(Step 3c) Apply the LARS algorithm to solve 



^ = argmin{i||y*-^/3|| 2 +nV%|/3 j |). 

* 1 J6B J 



(Step 3d) Compute = (VjVk) _1 Vj(y-Vb/3^) to obtain /3* = (&*/,& B 
(Step 3e) For j G A, set $ fc) = /3*. For j E S, set $ fc) = 4? 



2.2. Model selection for nonparametric component. While the SCAD 
penalty takes care of variable selection for the parametric components, we 
still need an approach to assess the structure of the nonparametric compo- 
nents. In this section, we will first transform the profile partial likelihood 
problem in (2.1) to a density estimation problem with biased sampling, and 
then derive a model selection tool based on the Kullback-Leibler geometry. 
In this part, we treat (3 as fixed, taking the value from the previous step in 
the algorithm. 

Let (ii, . . . ,?j\r) be the indices for the failed subjects. Then the profile 
partial likelihood in (2.1) for estimating r\ is 



P =i 



Consider the empirical measure on the discrete domain W n = {W±, . . . , 
W n } such that JfdP% = £ £" =1 f( W i)- Then e V / e??(ff n defines a den- 
sity function on W n . Let ai(-), . . . , ajv(") be weight functions defined on the 
discrete domain W n such that a p (Wk) = Y).(Xi p )e u k & , p = 1, . . . ,N. Alter- 
natively, one can think of CLp S clS vectors of weights with length n. Then each 
term in the profile partial likelihood, with the constant n ignored, becomes 
a p{^i P ) er! ^ Wip I f ctp(w) x e^^dP™. Thus, this resembles a density estima- 
tion problem with bias introduced by the known weight function a p (-). 

For two density estimates 771 and 7/2 in the above pseudo biased sampling 
density estimation problem, define their Kullback-Leibler distance as 



(2.3) -log / aJw)e^dP™ 
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Let 770 be the true function. Suppose the estimation of rjo has been done in 
a space %\, but in fact i]o G H2 C %\. Let 57 be the estimate of 770 in Hi. 
Let 77 be the Kullback-Leibler projection of f) in H2, that is, the minimizer 
of Kh(f],r]) for 77 G an d 7/ c be the estimate from the constant model. Set 
77 = 77 + a(fj — r/ c ) for a real. Differentiating KL(?7, 77) with respect to a and 
evaluating at a = 0, one has 

which, through straightforward calculation, yields 

KL(r/, t? c ) = KL(??, fj) + KL(f/, 77J . 

Hence, the ratio KL(f), , fj)/KL(r],r] c ) can be used to diagnose the feasibility 
of a reduced model 77 G %2 : the smaller the ratio is, the more feasible the 
reduced model is. 

2.3. Asymptotic results. Denote by 7i m (W) the Sobolev space of func- 
tions on W whose mth order partial derivatives are square integrable. Let 



U = |r/G^ m (W), J rj(w)dw = o\, 



and f)* be the estimate of 770 in H that minimizes the penalized partial 
likelihood 



Ufp + r,(Wi) - log Y t(t) ex P [[/J/3 + r,(W t )} \ dN,(t) 



,2.4) 

Note that % is an infinite-dimensional function space. Hence, in practice, 
the minimization of (2.4) is usually performed in a data-adaptive finite- 
dimensional space 

H n =Nj@ span{Rj(W il , •) : I = 1, . . . , q n }, 

where Mj = {?? G H, J(rj) = 0} is the null space of J, and Rj is the repro- 
ducing kernel (see, e.g., [28]) in its complement space Hj = HQ Mj, and 
{Wi ± , . . . , Wi qn } is a random subset of { W{ : i = 1, . . . , n}. When q n = n, one 
selects all the Wi,i = 1, . . . , n as the knots. This is the number of knots used 
in conventional smoothing splines. However, under the regression setting, 
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[16] showed that a q n of the order n 2 ^ r+l ^ +e ,Ve > is sufficient to yield 
an estimate with the optimal convergence rate. Here r is a constant associ- 
ated with the Sobolev space T~L, for example, r = 2m for splines of order m 
(one-dimension w) and r = 2m — S, V<5 > for tensor product splines (multi- 
dimension w). We shall show that such an order for q n also works for the r/ 
estimation in our partially linear Cox model. 

Let s n [f;(3, V ](t) = ^ n k=1 Y k (t)f(U k ,W k )e W (Ul(3 + v(W k )) and s n [(3, 
V ](t) = s n (l;(3,r ] ){t). Define 

s[f; j3, V ](t) = E[Y(t)f(U, W) exp(U T f3 + n(W))] 

f(u, w)e uT ^ w) q{t, u, w) du dw. 



For any functions / and g, define 
V(f,g)-- 



(2.5) 



T 



s[fg;(3, Vo ](t) s[f;(3, Vo ](t) s[g; (3, Vo ](t) \ 
s[(3, m ](t) s[P, VQ }(t) s[p, m ](t) ] 

x s [(3, Vo }(t)dA {t). 



Write V(f) = V(f,f). Let fj be the estimate that minimizes (2.4) in H n . 
Then we have the following theorem. 

Theorem 2.1. Under conditions A1-A7 in the Appendix, 
(V + XJ)(fi*-rj ) = O p (n- r ^ r+l ^) and (V + XJ)(fj- m ) = 0^1^). 

This is the optimal convergence rate for estimate of a nonparametric 
function. In the view of Lemma A.l, this theorem also indicates the same 
convergence rate in terms of the L2-norm. Also note that although a higher 
order of q n such as 0(n) would yield the same convergence rate for fj, it will 
make the function space H n too big to apply an entropy bound result that 
is critical in the proof of Theorem 2.2. 

Let C P {P) =Z p (/3) -nEjtiP0j(l/%l): where 

n „ 

W) = Y. J {UfP + 'h{W i )-\ogs n [l3,fi\{t)}dN i {t). 

i=l 

Let /3 = (/3i0; • ■ • , /5do) T — (Pio, 02o) T be the true coefficient vector. Without 
loss of generality, assume that f3 2 Q = 0. Let s be the number of nonzero com- 
ponents in /3 . Define a n = maxj{|p^ (\/3jo\)\ : Pjo °}> K = ma.yLj{\p'g. (\Pjo\)\ ■ 
(3 j0 / 0}, and 

b = (Pflj (£10) sgn(/3i ), . . . ,p' 9g (/3 s o) sgn(/3 s0 )) T , 
S = diag^OAoD/IAol, . . . ,p' es (\(3 s0 \)/\f3 s0 \). 
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Define tt\ : — > R s such that tti(u, w) =U\, where u\ is the vector of the 
first s components of u. Let Vo(vri) be defined like V(tti) in (2.5) but with 
(3 replaced by (3 . 

Theorem 2.2. Under conditions A1-A7 in the Appendix, if a n = 0(n~ 1 / 2 ), 
b n = o(l) and q n = o{n 1 / 2 ), then: 

(i) There exists a local maximizer (3 of Cp{(3) such that ||/3 — /3 || = 

(ii) Further assume that for all l<j<d, 9j = o(l), 9j l = o(n 1 / 2 ), and 

liminf liminf 9~ 1 p'g.(u) > 0. 

With probability approaching one, the root-n consistent estimator (3 in (i) 
must satisfy (3 2 = and 

MVoM + - P 10 + (Vofa) + S ) _1 b} -)• JV(0, Fo(tti)). 

2.4. Miscellaneous issues. In this section, we will propose the standard 
error estimates for both the parametric and the nonparametric components, 
and discuss the selection of the smoothing parameters 9 and A. 

2.4.1. Standard error estimates. Let l p ((3) be the profile log partial like- 
lihood in the last iteration of step 3 and 

S e (/3)=diag{p / ej (|/3 1 |)/|/3 1 |,...,^(|/3 p |)/|/3 p |}. 

Then the standard errors for the nonzero coefficients of are given by the 
sandwich formula 

cov(3) = - nY, e (p)y l C w{Vl P Cp)}{V 2 l p (P) - nZgiP)}- 1 . 

Sometimes the standard errors for zero coefficients are also of interest. A 
discussion of this problem is in Section 5. 

In (2.1), r] can be decomposed as r/ = + r/^ where 77 ^ lies in the null 
space of the penalty J representing the lower order part and 77M lies in 
the complement space representing the higher order part. A Bayes model 
interprets (2.1) as a posterior likelihood when 77M is assigned an improper 
constant prior and 77W is assigned a Gaussian prior with zero mean and cer- 
tain covariance matrix. The minimizer 17 of (2.1) then becomes the posterior 
mode. When the minimization is carried out in a data-adaptive function 
space % n with basis functions ip = (ipi, . . . ,ipq n ) T , we can write 77 = ip T c. 
Then a quadratic approximation to (2.1) yields an approximate posterior 
covariance matrix for c, which can be used to construct point-wise confi- 
dence intervals for r/. 
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2.4.2. Smoothing parameter selection. As shown in [33], the effective de- 
grees of freedom for li-penalty model is well approximated by the number 
of nonzero coefficients. Note that our SCAD procedure is implemented by a 
LASSO approximation at each step. Hence, if we let A be the set of nonzero 
coefficients, the AIC score for selecting 9 in step 3 is 

AIC = 2Z P 03) + 2L4|, 

where |^4| is the cardinality of A. 

As illustrated in Section 2.2, the estimation of r\ in step 2 can be cast 
as a density estimation problem with biased sampling. Let KL(?7o>?7a) be 
the Kullback-Leibler distance, as defined in (2.3), between the true "den- 
sity" e w / / dP™ and the estimate / / dP%. An optimal A should 
minimize KL(riQ,fj\) or the relative Kullback-Leibler distance 

RKL( M ) = -£{ /flp(ro)e , oMff , 

(2.6) 



log J a p {w)e^ w UP™ 



The second term of (2.6) is directly computable from the estimate fj\. But 
the first term needs to be estimated. Let ip be the vector of spline basis 
functions as in the previous subsection and rj = tp T c. Through a delete-one 
cross-validation approximation, a proxy for (2.6) can be derived as 

1 V^J mn i f (\ vM^wl^HP^H^QPi) 



N f^y J J N(N-l) 

where P X =I - 11 T '/N, Q = (ip(W h ), . . .,i/>(Wi N )), and H is the hessian 
matrix for minimizing (2.1) with respect to the coefficient vector c. A is 
chosen to minimize this score. 



3. Numerical studies. In the simulations, we generated failure times 
from the exponential hazard model with h(t\U,W) = exp[C/ T /3 + ?7o(W^)]- 
We used the same settings for the parametric component, which consists of 
eight covariates Uj,j = 1, . . . , 8. The Uj's were generated from a multivari- 
ate normal distribution with zero mean and Cov(Uj,U k ) = 0.5^1. The true 
coefficient vector was (3 = (0.8, 0, 0, 1, 0, 0, 0.6, 0) T . 

The theory in Section 2.3 gives the sufficient order for q n , the number of 
knots in our smoothing spline estimation of r\. In practice, [16] suggested 
q n = kn 2 ^ 2m+1 ^ with k = 10 if the tensor product splines of order m are 
used. Since we use tensor product cubic splines in all the simulations below, 
our choice is q n = 10n 2 / 5 . 
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3.1. Variable selection for parametric components. The nonparametric 
part had one covariate W generated from Uniform(0, 1). Two different 770 
were used: 

ma (w) = 1. 5 sin (^irw-^j or r) 0b (w) = 4(w - 0.3) 2 + 4.7e~ w - 3.4643. 

Note that both functions satisfies /J t]q(w) dw = 0. Given U and W, the 
censoring times were generated from exponential distributions such that the 
censoring rates are, respectively, 23% and 40%. Sample sizes n = 150 and 
500 were considered. One thousand data replicates were generated for each 
of the four combinations of r/o and n. 

For a prediction procedure M and the estimator (0^ , r/x ) yielded from 
the procedure, an appropriate measure for the goodness-of-fit under Cox 
model with ho(t) = 1 is the model error: ME((3 M , 77^) = E[(exp(— U T /3 M — 
t)m(W)) -exp(-C/ T /3 - Vo{W))) 2 }. The relative model error (RME) of 
M.i versus M2 is defined as the ratio ME(/3 yV1i ,i7_A / f 1 )/ME(/3_ A/l2 ,?7x 2 ). The 
procedure Mo with complete oracle is used as our benchmark. In A^o, 
(E/i, C/4, U?, W) are known to be the only contributing covariates, the ex- 
act form of t]q is known, and the only parameters to be estimated are the 
coefficients of ?7i, C/4, U7. Note that Afo can be implemented only in simula- 
tions, but is unrealistic in practice since neither the contributing covariates 
nor the form of 770 would be known. We then compare the performance of 
the following four procedures, including the proposed procedures, through 
their RMEs versus Mq: 

Ma- procedure with partial oracle and misspecified parametric 770, that is, 
(Ux, C/4, U7, W) are known to be the only contributing covariates but 
770 is misspecified to be of the parametric form 770 (W) = f3wW and 
Pw is estimated together with the coefficients for (C7"i, C/4, C/7); 

Mb- procedure with partial oracle and estimated 770, that is, (Ux, U4, U7, W) 
are known to be the only contributing covariates but the form of 770 
is unknown, and 770 is estimated together with the coefficients for 
(Z7i, C/4, U7) by penalized profile partial likelihood; 

Mc'- the proposed partial linear procedure with the SCAD penalty on (3; 

Md'- the proposed partial linear procedure with the adaptive LASSO penalty 
on /3. 

Procedure M.a has a misspecified covariate effect. We intend to show that 
the estimation results can be unsatisfactory if the semiparametric form of 
covariate effect is mistakenly specified as parametric. Procedure Mb is "par- 
tial oracle" and expected to have equal or better performance than proce- 
dures Mc and A4d- Note, however, Mb is unrealistic in practice since the 
contributing covariates would not be known. Mc and M d are two versions 
of the proposed partial linear procedure with different penalties on (3. 
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For each combination of rjo and n, we computed the following quantities 
out of the 1000 data replicates: the median RMEs of the complete oracle 
procedure .Mo versus the procedures Ma to Md, the average number of 
correctly selected nonzero coefficients (CC), the average number of incor- 
rectly selected nonzero coefficients (IC), the proportion of under- fit repli- 
cates that excluded any nonzero coefficients, the proportion of correct-fit 
replicates that selected the exact subset model, and the proportion over-fit 
replicates that included all three significant variables and some noise vari- 
ables. The results are summarized in Table 1. In general, a partial oracle with 
misspecified parametric r/o (procedure Ma) has much inferior performance 
when comparing with the other three procedures; the proposed procedure 
with the SCAD penalty (procedure Mc) or the adaptive LASSO penalty 
(procedure M d ) is competitive to the partial oracle with estimated 770 (pro- 
cedure Mb)] the SCAD penalty performs slightly better than the adaptive 
LASSO penalty. Also, the proposed procedure generally performs as well 
as the complete oracle. For procedure Mc, we also did some extra compu- 
tation to evaluate the proposed standard error estimate of (3. In Table 2, 
SD is the median absolute deviation divided by 0.6745 of the 1000 nonzero 
/3's that can be regarded as the true standard error, SD m is the median of 
the 1000 estimated SDs, and SD ma d is the median absolute deviation of the 
1000 estimated SDs divided by 0.6745. The standard errors were set to 
for the coefficients estimated as 0s. The results in Table 2 suggests a good 
performance of the proposed standard error formula for j3. 

To examine the estimation of 770, we computed the point- wise estimates 
at the grid w = (0, l,by = 0.01) for each data replicate. Then at each grid 
point, the mean, the 0.025 and the 0.975 quantiles of the 1000 estimates, 
together with the mean of the 1000 95% confidence intervals were computed. 
The results are in Figure 1. The plots show satisfactory nonparametric fits 
and standard error estimates. 

3.2. Model selection for nonparametric components. In this section, we 
present some simulations to evaluate the model selection tool for nonpara- 
metric part introduced in Section 2.2. We used the SCAD penalty on the 
parametric components in this section. Two covariates W\ and W2, indepen- 
dently generated from Uniform(0, 1), were used. We considered two scenarios 
for the true model of the nonparametric part: (i) nonparametric univariate 
model t]q(W) = 7701 (Wi) and (ii) nonparametric bivariate additive model 
r]o(W) = rjoi (Wi) + 7702(^2)- For scenario (i), the data sets generated in the 
last section were used, with W\ being the existing W covariate and W2 being 
an additional noise covariate. The fitted model was nonparametric additive 
in W\ and W%. The ratios KL(?} , 77) / KL(t/ , rj c ) for the projections to the uni- 
variate models t]o(W) = 7701 (Wi) and t]q(W) = 7702(^2) were computed. For 
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Table 1 

Variable selection for parametric components (Section 3.1) 

No. of nonzeros Proportion of 



Procedure 


MRME 


CC 




IC 




Under-fit Correct-fit 


Over-fit 








n 


= 150, ryo 


= ?70a 


(23% censoring) 






\A . 


U.lOo 










_ 


_ 




KA 


U.4 ( 










- 


- 




A A 


n a no 
U.4Ua 


o nnc 
z.aao 




U.OZO 




0.002 


0.476 




Md 


0.387 


2.998 




0.959 




0.002 


0.444 


0.554 








n 


= 150, 770 


= 7?06 


(40% censoring) 






Ma 


0.167 
















Mb 


0.711 
















Mc 


0.518 


2.996 




0.949 




0.004 


0.430 


0.566 


Md 


0.563 


2.998 




1.131 




0.002 


0.378 


0.620 








ii 


= 500, 770 


= ?70a 


(23% censoring) 






Ma 


0.056 
















Mb 


0.431 
















Mc 


0.396 


3.000 




0.717 




0.000 


0.525 


0.475 


Md 


0.375 


3.000 




0.736 




0.000 


0.540 


0.460 








n 


= 500, 770 


= ?7<)f> 


(40% censoring) 






Ma 


0.057 
















Mb 


0.712 
















Mc 


0.619 


3.000 




0.749 




0.000 


0.512 


0.488 


Md 


0.628 


3.000 




0.776 




0.000 


0.529 


0.471 



scenario (ii), we considered two sample sizes n = 150 and 300. The true t]q 
was 



7?0(W1,W 2 ) = 0. 7770a (Wi ) +0.37?0ftO2) 



Table 2 

Standard deviations for f3's in the partial linear SCAD procedure Mc (Section 3.1) 
01 $4 $7 



n, censor % SD SD m (SD raad ) SD SD m (SD mad ) SD SD m (SD mad ) 

150, 23% 0.124 0.113 (0.015) 0.141 0.121 (0.017) 0.135 0.109 (0.015) 

150, 40% 0.159 0.135 (0.017) 0.188 0.145 (0.021) 0.155 0.128 (0.019) 

500, 23% 0.065 0.059 (0.005) 0.073 0.063 (0.005) 0.062 0.057 (0.005) 

500, 40% 0.075 0.070 (0.006) 0.088 0.076 (0.006) 0.078 0.066 (0.006) 
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n=150, 23% censoring n=150, 40% censoring 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 




w w 

Fig. 1. Estimates for nonparametric components (Section 3.1). Dotted lines are true 
function, solid lines are connected point-wise mean estimates, faded lines are connected 
0. 025 and 0. 975 quantiles of the point-wise estimates, and dashed lines are the connected 
point-wise 95% confidence intervals. 

or 

rjo(w 1 ,w 2 ) = r] a(wi) + V0b('W2), 

where r/oa and %6 are as defined in Section 3.1. The censoring times were 
generated from exponential distributions such that the resulting censoring 
rates were, respectively, 25% and 39%. Note that both choices of rjo are 
additive in w\ and W2- The fitted model was the nonparametric bivariate 
full model with both the main effects and the interaction. Then the ratios 
KL(fj,fj)/KL(fj,r] c ) for the projections to the bivariate additive model and 
the two univariate models were computed. In both scenarios, we claim a 
reduced model is feasible when the corresponding ratio KL(?}, ff) / KL(f/, rj c ) < 
0.05. 

For each of the eight combinations of 770 and n, we simulated 1000 data 
replicates and computed the proportions of replicates that produced the fol- 
lowing results in the reduced model: selected the main effect of W%, selected 
the main effect of W2, selected the interaction W\ : W2, under- fitted the 
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Table 3 

Model selection for nonparametric components (Section 3.2) 

Proportion of selecting Proportion of 
Sample 

size Wi W 2 Wi:W 2 Under-fit Correct-fit Over-fit 

True model: J7o(ioi,w 2 ) = r)o a (wi), 23% censoring 

n = 150 1.000 0.036 0.000 0.964 0.036 

n = 500 1.000 0.002 0.000 0.998 0.002 

True model: »jo(wi,W2) = Vob( w i), 40% censoring 

n = 150 1.000 0.304 0.000 0.696 0.304 

n = 500 1.000 0.062 0.000 0.938 0.062 

True model: 770(101,102) = 0.7rjo a (wi) + 0.3r?oi>(w2), 25% censoring 

n = 150 1.000 0.998 0.084 0.002 0.914 0.084 

n = 300 1.000 1.000 0.013 0.000 0.987 0.013 

True model: 770(101,102) = Voa(wi) + nob{w2), 39% censoring 

n = 150 1.000 0.672 0.201 0.328 0.471 0.201 

n = 300 1.000 0.616 0.096 0.384 0.520 0.096 



model by excluding at least one truly significant effect, correctly fitted the 
model by reducing to the exact subset model, and over-fitted the model by 
including all the truly significant effects and some irrelevant effects. These 
proportion results are summarized in Table 3. It shows that the variable 
selection tool for the nonparametric component works very well. The better 
performance appears to be associated with bigger sample sizes and lower 
censoring rates. 

4. Example. An example in [17] is a study on two sexually transmit- 
ted diseases: gonorrhea and chlamydia. The purpose of the study was to 
identify factors that are related to time until reinfection by gonorrhea or 
chlamydia given an initial infection of either disease. A sample of 877 in- 
dividuals with an initial diagnosis of gonorrhea or chlamydia were followed 
for reinfection. Recorded for each individual were follow-up time, indicator 
of reinfection, demographic variables including race (white or black, U±), 
marital status (divorced/separated, married or single, U2 and Us), age at 
initial infection (Wi), years of schooling (W2) and type of initial infection 
(gonorrhea, chlamydia or both, C/4 and C/5), behavior factors at the initial 
diagnosis including number of partners in the last 30 days (U§), indicators 
of oral sex within past 12 months and within past 30 days {U7 and Us), indi- 
cators of rectal sex within past 12 months and within past 30 days (Ug and 
U\o) and condom use (always, sometimes or never, U\\ and U12), symptom 
variables at time of initial infection including presence of abdominal pain 
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(C/13), sign of discharge (U14), sign of dysuria (U15), sign of itch (Uiq), sign 
of lesion (Un), sign of rash (Uis) and sign of lymph involvement (U\g) and 
symptom variables at time of examination including involvement vagina at 
exam (U20), discharge at exam (U21) and abnormal node at exam (L^)- 

We used q n = 10 • 877 2//5 = 151 knots in all the analysis below. We first 
considered the partial linear Cox model 

{3 22 ^| 

j=l k=l J 

where 973 (V^) = Tfo(Wu, W^i) is the interaction term between W\ and W2. 
However, the interaction term was found to be negligible with the ratio 
KL(i],fj)/KL(fi,r] c ) = 0.003. Hence, we took out this interaction term and 
refitted the model. In this model, neither W\ (age) nor W2 (years of school- 
ing) in the nonparametric component were found to be negligible, with 
the ratios KL(i],fj)/KL(fj,rj c ) equal to 0.633 for removing W\ and 0.259 
for removing W2- Their effects are plotted in Figure 2 together with the 
95% point-wise confidence interval. We can see that the hazard increased 
with age at both ends of the age domain (between age 13 and 20, and be- 
tween age 38 and 48) and stayed flat in the middle, and that the hazard 
decreased with years of school from 6 years to 10 years but stayed flat af- 
terwards. The fitted coefficients from the proposed method with the SCAD 
penalty are in Table 4 together with their standard error estimates. For 
comparisons, Table 4 also lists the fitted coefficients and standard errors 
for three other models, namely the proposed semiparametric relative risk 
model with the adaptive LASSO penalty, and the parametric relative risk 
models with the SCAD and the adaptive LASSO penalties. We can see 




15 20 25 30 35 40 45 



10 12 14 16 
years of schooling 



18 



Fig. 2. Nonparametric component estimates for sexually transmitted diseases data. Left: 
effect of age at initial infection. Right: effect of years of schooling. Solid lines are the 
estimates, dashed lines are 95% confidence intervals and dotted lines are the reference 
zero lines. 
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Table 4 

Fitted coefficients and their standard errors for sexually transmitted diseases data. 
(Models from top to bottom: semi-parametric relative risk with SCAD penalty and with 
adaptive LASSO penalty, parametric relative risk with SCAD penalty and with adaptive 

LASSO penalty) 



age 


yschool 


npart 


raceW 


maritalM 


maritalS 


- (-) 


- (-) 


(-) 


(-) 


(-) 


0.487 (0.212) 


~ (-) 


- (-) 


0.060 (0.048) 


-0.127 (0.097) 


(-) 


0.448 (0.186) 


(-) 


-0.059 (0.018) 


(-) 


(-) 


(-) 


0.332 (0.213) 


(-) 


-0.119 (0.031) 


0.026 (0.024) 


(-) 


o (-) 


0.210 (0.119) 


typeC 


typeB 


oralY 


oralM 


rectY 


rectM 


-0.412 (0.149) 


-0.337 (0.144) 


-0.336 (0.201) 


-0.341 (0.235) 


OH 


o(-) 


-0.349 (0.137) 


-0.300 (0.130) 


-0.330 (0.155) 


-0.318 (0.173) 


o(-) 


oh 


-0.376 (0.149) 


-0.249 (0.145) 


-0.236 (0.202) 


-0.348 (0.235) 


o(-) 


oh 


-0.228 (0.096) 


-0.083 (0.065) 


-0.110 (0.058) 


-0.371 (0.117) 


o(-) 


O(-) 


abdom 


disc 


dysu 


condS 


condN 


itch 


0.253 (0.151) 


o(-) 


0.193 (0.152) 


oh 


-0.327 (0.114) 


oh 


0.177 (0.120) 


o(-) 


0.089 (0.074) 


0.152 (0.114) 


-0.291 (0.106) 


o(-) 


0.285 (0.148) 


O(-) 


O(-) 


O(-) 


-0.296 (0.114) 


o(-) 


0.184 (0.094) 


o(-) 


O(-) 


O(-) 


-0.223 (0.092) 


o(-) 


lesion 


rash 


lymph 


involve 


discE 


node 


O(-) 


(-) 


oh 


0.423 (0.166) 


-0.460 (0.220) 


o(-) 


O(-) 


o(-) 


O(-) 


0.327 (0.159) 


-0.407 (0.209) 


o(-) 


O(-) 


o(-) 


O(-) 


0.392 (0.168) 


-0.443 (0.221) 


o(-) 


O(-) 


o(-) 


o(-) 


0.289 (0.133) 


-0.280 (0.163) 


o(-) 



that the SCAD penalty yielded sparser models than the adaptive LASSO 
penalty, and that both parametric models missed the age effect. Common 
factors identified by all the four procedures to be associated with reinfection 
risk are marital status, type of infection, oral sex behavior, condom use, 
sign of abdominal pain, sign of lymph involvement and sign of discharge at 
exam. 

5. Discussion. We have proposed a Cox PH model with semiparametric 
relative risk. The nonparametric part of the risk is estimated by smooth- 
ing spline ANOVA model and model selection procedure derived based on 
a Kullback-Leibler geometry. The parametric part of the risk is estimated 
by penalized profile partial likelihood and variable selection achieved by 
choosing a nonconcave penalty. Both theoretical and numerical studies show 
promising results for the proposed method. An important question in using 
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the method in practice is which covariate effects should be treated as para- 
metric. We suggest the following guideline for making choices. As a starting 
point, the effects of all the continuous covariates are put in the nonpar a- 
metric part and those of the discrete covariates in the parametric part. If 
the estimation results show that some of the continuous covariate effects 
can be described by certain parametric forms such as linear form, then a 
new model can be fitted with those continuous covariate effects moved to 
the parametric part. In this way, one can take full advantage of the flexible 
exploratory analysis provided by the proposed method. 

We thank a referee for raising the interesting question on the standard 
error estimates for the coefficients estimated to be in 0. References [25] 
and [7] suggested to set these standard errors to Os based on the belief that 
those covariates with zero coefficient estimates are not important. This is 
the approach adopted here. When such a belief is in doubt, nonzero standard 
errors are preferred even for coefficients estimated to be O's. This problem 
has been addressed only in a few papers. Reference [22] looked at the prob- 
lem for LASSO but it is based on a smooth approximation. Reference [24] 
presented a Bayesian approach and pointed out that no fully satisfactory 
frequentist solution had been proposed so far, no matter LASSO or SCAD 
variable selection procedure is considered. This problem presents an inter- 
esting challenge that we hope to address in some future work. 

Another choice of pg{\ ■ |) is the adaptive LASSO penalty [31]. Our simu- 
lations in Section 3.1 indicates a similar performance when compared to the 
SCAD penalty. So we decided not to present the details here. 

Although our method is presented for time-independent covariates, a 
lengthier argument modifying [23] can yield similar theoretical results for 
external time-dependent covariates [15]. However, the implementation of 
such extension is more complicated and not pursued here. 

A recently proposed nonparametric component selection procedure in a 
penalized likelihood framework is the COSSO method in [19] where the 
penalty switches from J{r}) to J 1 / 2 (7/). Taking advantage of the smoothing 
spline ANOVA decomposition, the COSSO method does model selection by 
applying a soft thresholding type operation to the function components. An 
extension of COSSO to the Cox proportional hazards model with nonpara- 
metric relative risk is available in [18]. Although a similar extension to our 
proportional hazards model with semiparametric relative risk is of interest, 
it is not clear whether the theoretical properties of the COSSO method such 
as the existence and the convergence rate of the COSSO estimator can be 
transferred to the estimation of r/ under our semiparametric setting. Fur- 
thermore, the dimension of the function space in COSSO is 0(n), too big to 
allow an entropy bound that is critical in deriving the asymptotic properties 
of fa. 
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APPENDIX: PROOFS 

For z = (u,w), let p z (t) = exp(u T '(3 + r]o(w))q(t,u,w)/s\j3,rjo\(t) and f t = 
J J f(u, w) Pz (t) du dw. Let S(t, u, w) = E[Y(t) = 1\U = u, W = w] = P(Y(t) = 
1\U = u,W = w) and q(t, u, w) = S(t, u, w)p(u, w), where p(u, w) is the den- 
sity function of (U, W). Let Z = U x W be the domain of the covariate 
Z = (U T ,W T ) T . We need the following conditions. 

Al. The true coefficient /3o is an interior point of a bounded subset of M. d . 

A2. The domain Z of covariate is a compact set in lR c^+ ' ^ . 

A3. Failure time T and censoring time C are conditionally independent 

given the covariate Z. 
A4. Assume the observations are in a finite time interval [0,r]. Assume 

that the baseline hazard function ho(t) is bounded away from zero and 

infinity. 

A5. Assume that there exist constants k-2 > k\ > such that k\ < q(t, u, w) < 
k 2 and \ -j^q(t,u,w)\ <k 2 - 

A6. Assume the true function tjq S T~L. For any r\ in a sufficiently big con- 
vex neighborhood Bq of 770, there exist constants c\,C2 > such that 
Cie mH < e r,{w) < C2e ??oW f or ^1 w _ 

A7. The smoothing parameter A >c n~ r ^ r+1 \ 

Condition Al requires that (3q is not on the boundary of the param- 
eter space. Condition A2 is also a common boundedness assumption on 
covariate. Condition A3 assumes noninformative censoring. Condition A4 
is the common boundedness assumption on the baseline hazard. Condition 
A5 bounds the joint density of (T, Z) and thus also the derivatives of the 
partial likelihood. Condition A6 assumes that 770 h as proper level of smooth- 
ness and integrates to zero. The neighborhood -Bo i n condition A6 should 
be big enough to contain all the estimates of rjo considered below. When the 
members of Bq are all uniformly bounded, condition A6 is automatically 
satisfied. The order for A in condition A7 matches that in standard smooth 
spline problems. 

We first show the equivalence between V(-) and the L2-norm || • |||. 

Lemma A.l. Let f G %. Then there exist constants < C3 < C4 < 00 
such that 

c&||/|i!<v(/)<d||/||2. 

Proof. For z = (u,w), let p z (t) = exp(u T {3 + rjo(w))q(t,u,w)/s[{3, rj ](t) 
and f t = JJ f(u,w)p z (t)dudw. Simple algebraic manipulation yields 

V{f) = I T {I /^( u ' u; )-^) 2 ^( t ) du(iu; } s [' 9 ' 7 ? ]W^Ao(t). 



VARIABLE SELECTION FOR SEMIPARAMETRIC COX MODELS 19 
By conditions A4 and A5, there exist positive constants c\ and C2 such that 



ci JJyJ J (f(u,w)-f t ) 2 dudwjdA (t) 

<V(f)<c 2 J U J (f(u,w)-f t ) 2 dudw 
Let m(Z) < oo be the Lebesgue measure of Z. Then 
(f(u,w) - ft) 2 dudw \dA (t) 



T 

= A (r) J j f 2 (u,w)dudw + m(Z) J J [f t ] 2 dA (t). 

The lemma follows from the Cauchy-Schwarz inequality and condition A4. 

□ 

Proof of Theorem 2.1. We will prove the results using an eigenvalue 
analysis of three steps. In the first step (linear approximation), we show the 
convergence rate O p (n- r A r+1 )) for the minimizer fj of a quadratic approxi- 
mation to (2.4). In the second step (approximation error), we show that the 
difference between fj and the estimate fj* in T~L is also O p (n~ r ^ r+1 ^), and so 
is the convergence rate of fj* . In the third step (semiparametric approxima- 
tion), we show that the projection if of fj* in % n is not so different from 
either fj* or the estimate f] in T~L n , and then the convergence rate of f] follows. 

A quadratic function B is said to be completely continuous with respect 
to another quadratic functional A, if for any e > 0, there exists a finite num- 
ber of linear functionals L\, . . . , such that Ljf = 0,j = 1, . . . , k, implies 
that B(f) < eA(f); When a quadratic functional B is completely continu- 
ous with respect to another quadratic functional A, there exists eigenfunc- 
tions {(j) u ,v = 1,2,...} such that i?(^,^) = 5 Ufl and A((j) u ,(j)^) = p v 8 UfJi , 
where 8 V a is the Kronecker delta and < p v \ oo. And functions satisfy- 
ing A(f) < oo can be expressed as a Fourier series expansion f = f u (j) u , 
where f v = B(f,<p u ) are the Fourier coefficients. See, for example, [9] and 
[29]. 

We first present two lemmas without proof. The first one follows directly 
from the results in Section 8.1 of [9] and Lemma A.l. The second one is 
exactly Lemma 8.1 in [9]. 

Lemma A. 2. V is completely continuous to J and the eigenvalues p v of 
J with respect to V satisfy that as v — )■ oo, p~ l = 0{u r ). 

Lemma A.3. As A -> 0, the sums (mX)* ' £„ (i+fc? ' and ^ T+fe: 
are all of order 0{\~ l l T ). 
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Step 1 (Linear approximation). A linear approximation fj to fj* is the 
minimizer of a quadratic approximation to (2.4), 

I n r 

— J2 / {riiW^-s-^PMttWlri-W^MWdNiit) 
n U J r 

(A.l) 

+ ^V(r ] - Vo ) + -J(i 1 ). 

Let r\ = ~Y^ V rj u (j) u and r/Q = f]ufi4>u be the Fourier expansions of 77 and tjq. 
Plugging them into (A.l) and dropping the terms not involving r\ yield 

(A.2) { -Vulu + ^ {Vu - r]»,o) 2 + ^Puvl 1 , 

where lv = ±T,?=i J T {MWi) - s-^M^n^PMWdNitt). The 
Fourier coefficients that minimize (A.2) are fj v = (■y u + ry^o)/(l + \p u )- Note 
that j 4> u {w) dw = and V((j) u ) = 1. Straightforward calculation gives E[y u ] = 
and E[yl] = n~ 1 . Then 

E[v{fj ~ m)] = (i+\ Pv y + A £ (i+t,) 2 ^' ' 

(A.3) 

£[A J« - %)] = ^E (1+ A ^ )2 + A E (i + Ap,)^ ^°- 

Combining Lemma A.3 and (A.3), we obtain that (V + XJ)(fj — r/o) = Op(A + 
n _1 A~ 1 / r ), as n — )• 00 and A — > 0. 

Step 2 (Approximation error). We now investigate the approximation 
error fj* — fj and prove the convergence rate of fj* . Define A^ g {a) and Bf g (a), 
respectively, as the resulting functionals from setting 7/ = / + ag in (2.4) and 
(A.l). Differentiating them with respect to a and then setting a = yields 

1 n r 

A f, 9 (°) = — Yl / MWO-a- 1 ^,/]^)^^;^/]^)}^^) 
n i=i • /r 

(A.4) 

+ AJ(/,<?), 

n 

B /, 9 (°) = --E /^(WO-S-MA^lWSnb^^oKt)}^^) 

(A.5) 



i=i 1 



+ V(f- m ,g) + XJ(f,g). 
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Set / = fj* and g = fj* — fj in (A. 4), and set / = fj and g = fj* — fj in (A. 5). 
Then subtracting the resulted equations gives 



(A.6) 



VV* (fj* -fj)- pfj(f)* -fj) + \J {fj* - fj) 

= V(fj - r] ,fj* -fj) + li m {fj* -fj)- Mt?(*T - fj), 
where p f (g) = ± YS=l It s n 1 IP, /](*)««[»! A /](*) ^W- Define 

c r, 1r .s = s n [fg;(3,r] ](t) s n [f;/3,rj ](t) s n [g; f3,rj ](t) 
n[h9lU s n [(3, V0 }(t) SnlPMW s n [(3, m ](t) 
and S[f,g](t) be its limit. The following lemma is needed to proceed. 

Lemma A. 4. 

i - r 

-E iSn[f,g](t)dN t (t) 



i=l 



T 



V(f,g) + o p ({(V + M)U)(V + A</)G?)} 1/2 ). 



Proof. Let / = Y^ u fv^u an d g = Yludv&n ^> e * ne Fourier series expan- 
sion of / and g. Reference [1] shows that sup t |s n [/3,r/o] — s[/3, 770] | converges 
to zero in probability. Note that 

M(t) = M(t\Z) = N(t) - fs[l3, t, ](t) dA (r) 

J 

defines a local martingale with mean zero. Combining the above uniform 
convergence result and the martingale property with the boundedness con- 
dition, we obtain that for any v and fi, 

2n 



E 



< 00. 



5[^,0 M ](t)div(t)-y(^,^) 

Then from the Cauchy-Schwarz inequality and Lemma A. 3, 

1 n r 

-J2 / S n [f,g](t)dN t (t)-V(f,g) 
n i=i ^ r 



< 



yy^ l - 



V f.1 
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2n, 1/2 



n 

-V / S„[6,,0 M ](t)dJVi(*)-^>' 
n £i^ 



1/2 



^^(l + A^)(l + Ap M )/^ 

= Orin-Wx-V'MV + XJ)(f)(V + XJ)(g)}^ 2 . □ 

A Taylor expansion at 770 gives 

1 n f 

i*r(v*-v)-m(v*-fi) = -J2 / S »W + 

n 1=1 ^ r 

1 n . 

fJ-fj(fj* ~ fj) ~ ^(V* ~ fj) = -^2 / S'n^-TJd, »?* diVi(t)(l + o p (l)). 

n i=l ^ T 

Then by the mean value theorem, condition A6, Lemma A. 4 and (A. 6), 

( Cl V + XJW-fj)(l + o p (l)) 

< {(|1 - c\V + A J) (77* - r?)} 1/2 P ({(|l - c|y + AJ)(r? - 770)} 1 / 2 ) 

for some c G [ci,c&]. Then the convergence rate of fj* follows from that of fj 
proved in the previous step. 

Step 3 (Semiparametric approximation). Our last goal is the conver- 
gence rate for the minimizer fj in the space T~L n . For any h € T~L T~L n , 
one has h(W k ) = J{Rj(W iv -),h) = 0, so s qn [h? ;/3,7/o](t) = 
^Ef=i^(i)^'W fe )ex P (C^/3 + V (W ik )) = for j = 1,2 and 

Xw=i Jr 1^' ^1 = Hence, by the same arguments used in the 

proof of Lemma A. 4, 



V(h) 

(A.7) 



-E / sumw^)-"^) 

?n l=1 Jr 



= O p (q-^ 2 X-^ r )(V + XJ)(h) = o p (XJ(h)), 

where the last equality follows from q n >c n 2 ^ r+l ^ +e and condition A7. 

Let 77* be the projection of fj* in H n . Setting f = fj* and g = fj* — 77* in 
(A. 4) and noting that J(rj*,fj* — 77*) = 0, some algebra yields 



(A. 



wf-^) = (^E/ w 

I i=l JT 



t?*)^)^) -7^(77* -77* 

{fJ, r (fj* - 7]*) - (J, m (fj* - 7/*)}. 



VARIABLE SELECTION FOR SEMIPARAMETRIC COX MODELS 23 

Recall that lv = ±YZ=i jf r M^i) dNi(t) - ,%,(<&,) with E hu] = and 
£>[7 2 ] = 1/n. An application of the Cauchy-Schwarz inequality 
and Lemma A. 3 shows that the first term in (A. 8) is of order {(V + XJ)(fj* — 
r/*)} 1 / 2 Op(?i _1 / 2 A~ 1 / 2r ). By the mean value theorem, condition A6, Lemma 
A. 4 and (A. 7), the remaining term in (A. 8) is of order o p ({\J(ff — i]*)(V + 
XJ)(fj* — t/o)} 1 ^ 2 )- These, combined with (A. 8) and the convergence rates of 
fj*, yield AJ(r)* -77*) = O p (rr 1 \- 1 / r + \) and V{fj* -rf) = o p {n^X~ l l r + X). 

Note that J(fj* - r?*,r?*) = J(fj* — r)*,f)) = 0, so J(fj*,fj* — ff) = J(fj* - 
rf) + J(rj*,rj* — fj). Set / = fj and g = r) — rf in (A. 4), and set f = fj* and 
g = fj* — fj in (A. 4). Adding the resulted equations yields 

m(V - V*) - Vvoiv - V*) + XJ(fj- rf) + XJ(fj* - if) 
= { \ E Jjfi* ~ dN ^ ~ ~ V*) | 

(A.9) 

- W vh -n ) -v vo (v -v )} 

+ {/^* (r} -??*)- /i w (77 - 77*)}. 

By the mean value theorem, condition A6, and Lemma A. 4, the left-hand 
side of (A.9) is bounded from below by 

(dV + XJ)(fj - 7?*)(1 + 0p (l)) + XJ{fj* - 7?*). 

For the right-hand side, the terms in the first and second brackets are, respec- 
tively, of the orders {(V + A </)(?)* — 7/*)} 1 / 2 O p (n~ 1 / 2 A~ 1 / 2r ) and o p ({XJ(fj* ~ 
ff)iy + XJ)(fj* — 770)} 1 / 2 ) by similar arguments for (A. 8), and the terms in 
the third bracket is of the order 

{(V + XJ)(fj - 7 / *)} 1 / 2 0p ({AJ(7}* - r,*)} 1 / 2 ) 

by condition 3, Lemma A. 4 and (A. 7). Putting all these together, one ob- 
tains (V + XJ)(fj - 77*) = O p (n- 1 A- 1 / r + A) and hence (V + A J) (77 - r/ ) = 
O p (n _1 A _1 / r + A). And an application of condition A7 yields the final con- 
vergence rates. □ 

Proof for the asymptotic properties of /3. Let P n be the em- 
pirical measure of (ATj,Aj = l,Zi),i = l,...,n such that it is related to 
the empirical measure Q n of (Xj, Aj, Zj),z = l,...,n by P n f = f f dP n = 
jAfdQ n = re" 1 Y17=i ^if(Ti, Aj, Zi). Let P be its corresponding 
(sub)probability measure. Let L2OP) = {f '■ f f 2 dP < 00} and || • H2 be the 
usual L2-norm. For any subclass J- of L2(P) and any e > 0, let jVr.i (e, J 7 , Li2(P)) 

be the bracketing number and Jj.j (5, J 7 , L2 (P)) = Jq Jl + logA/j.](e, J 7 , Z^(P)) cfe. 
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Lemma A. 5. Let mo(t, u, w; (3, rj) = u 1 (3 + rj(w) — log s[/3, rj\(t), m\{t, u, 
w; s,(3,rj) = l [s < t] exp(/u T /3 + rj(w)), and m 2 {t,u,w; s, (3,r), f) = l [s < t] f(u, 
w) exp(u T /3 + rj(w)). Define the classes of functions 

M (5) = {m : ||/3 - f3 \\ < 5, \\ V - noh < ^ 

Mi(S) = { mi :s£T, ||/3 - (3 \\ < 5, \\ v - voh < S}, 

M 2 (8) = {m 2 : S eT,\\(3- /3 || < 5, \\ v - m \\ 2 < 6, \\h\\ 2 < 6}. 

Then J [ . ] (S,M ,L 2 (P)) < c Q q]J 2 5 and J^S, Mj , L 2 (P)) = Cj 5{q n + log(l/ 

m 1/2 J = 1,2. 

Proof. The proof is similar to that of Corollary A.l in [10] and thus 
omitted here. □ 

Lemma A. 6. 

sup Is- 1 [/3 , rj] (t)s[U; f3 , rj] (t) - s" 1 [(3 ,f)} (t)s n [U ;(3 ,r)} (t) \ = o^n" 1 / 2 ) . 
teT 

Proof. Write 

s ~ 1 [/3 ,rj] (t ) s [U ; /3 ,fj] (t ) - s~ 1 [/3 ,rj] (t ) s n [U; (3 , rj] (t ) 
= s[(3 ,f)](t)A ln (t) - s[U;/3 ,fi}(t)A 2n (t) 

where A ln = s n [U; j3 ,rj](t) - s[U;(3 ,fj](t) and A 2n = s n \f3 ,fj](t) - s[/3 , 
fj](t). Note that g n = o(n 1//2 ), hence the result follows from Lemma 3.4.2 
of [27] and Lemma A. 5. □ 

PROOF of Theorem 2.2. Let ^ n = n~ 1 / 2 . To prove 2.2(i), we need to 
show that V<5 > 0, there exists a large constant C such that 

p{ sup £ P (/3 + ln v) < C P (J3 )\ > 1 - S. 
IMI=c J 

Consider Cp((3 + ^/ n v) — £p(/9 ). We can decompose it to the sum of D n \ = 
l p (/3 + j n v) — I p (Pq) and the penalty difference D n2 . As shown in [7], under 
the assumption of a n = 0{n~ 1 / 2 ) and b n = o(l), n~~ 1 D n2 is bounded by 

(A.10) ^lna n \\v\\ +llb n \\v\\ 2 = C7 2 ( v / s + 6 n C), 

where s is the number of nonzero elements in /3 . 

Applying the second order Taylor expansion to n~ l D n \, gives 

(A.ll) n~ x B n x = 7 n v T Ji„ - \"ily T 3 2n v + o p (n _1 ) 
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with J ln = P n {U - s n 1 [/3 , fj] (t)s n [U;/3 , fj] (t)}, and J 2n = P n {s n 2 [f3 , fj] (t) x 
[s n [UU T ;(3 ,fi](t) Sn [(3 ,fi}(t) - Sn [U;/3 ,fi](t)s n [U;P ,fi}(t) T }}, where U(u, 
w) = u. 

Let U n = Y!l=i Ui/n. Note that S - l [(3 Q ,f)}(t)s n [U n ;(3 Q ,<q\{t) = U n . We have 

Jln = Pn{U ~U n - S~ l [/3„ ,fj] (t)s n [U - U n ; O , fj] (t)} 

(A.12) 

= hn + hn + ^3n ; 

where 

hn = (Pn ~ P){U ~U n - s" 1 ^, f)](t)s[U - U n ; (3 , fj](t)}, 

hn = P n {s-\P Q mW\M{t) ~ S-\t3 Q ,fj\{t)s n [U-,M(t)}, 

hn = P{U -U n - s-^PoMMU - U n ;(3 ,fj](t)}- 

Lemma 3.4.2 of [27] and Lemma A. 5 indicate that (P n — P){s~ 1 [(3q, fj](t)s[U — 
U n ; f3 ,fj](t)} = o p (n~ 1 / 2 ), where the fact that q n = o p (n~ l l 2 ) is used again. 
Also (P n - P){U - U n ] = Opin- 1 / 2 ) by the LLN. Hence, we have hn = 
O p (n -1 / 2 ). Lemma A. 6 gives hn — o p (n~ 1 / 2 ). Finally, by the boundedness 
assumption I 3n = P^" 1 ^,,. 3] - U;j3 , fj](t)} = O p (E\U n - U\) = 

O p (n~ 1 / 2 ). Hence, J ln = O p (n _1 / 2 ). Also, J 2n converges to V(U) > 0. Thus, 
when C is sufficiently large, the second term in (A.ll) dominates both terms 
in (A.10). Theorem 2.2(i) follows. 

Next, we shall show the sparsity of 0. It suffices to show that for any given 
Pi satisfying \\Pi~ /3 10 || = O p (n~ 1 / 2 ) and any j = s + 1, . . . , d, d£p(P)/df3j > 
for < pj < Cn~ 1 / 2 and dC P (p)/d(3 j < for -Cn~ 1 / 2 < fa < 0. For fa / 
and j = s + 1, . . . , d, 

n~ l dC P (p)/d^=P n {U j -s~\pMt)sn[Uf,PMt)}-p'e j m)^m 

Similar to bounding Ji n , the first term can be shown to be O p (n~ 1 ^ 2 ). Recall 
that 6J 1 = o(n 1 / 2 ) and limnnv^oo liminf u _ i> o+ Q^POj ( M ) > Hence, the sign 
of dCp(P)/df3j is completely determined by that of /3j. Then fi 2 = 0. 

Lastly, we show the asymptotic normality of Pi using the result in [21]. 
Let Zi = (Xi, Aj, Zj). Note that Pi is the solution of the estimating equation 

n 

(A.13) ^MfaP^ft-nC^O, 

i=l 

where M(z,P 1 ,rf) = f{Ui - s~ l [Pi, 7]](t)s n [Ui; Pi, rj\(t)} dN(t) and Ci = 
(p^dADsgn^O^.^^d^Dsgn^f.Let 

D(zh)= [I ^Ilh^mll^M s n[Ui;Pio,Vo](t) Sn[h;Pi ,rjo](t) \ . 
{,) J \s n [Uih;P 10 ,r) ](t)~ s n [Pi ,Vo](t) s n [P 10 , Vo ](t) J 
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be the Frechet derivative of M(z,/3 10 ,r/) at r/o- Since the convergence rate of 
fj is n~ r /[ 2 ( r+1 ^ =o(n -1 / 4 ), the linearization assumption (Assumption 5.1) 
in [21] is satisfied. A derivation similar to bounding (A. 12) can verify the 
stochastic assumption (Assumption 5.2) in [21]. Direct calculation yields 
E[D(z, rj — 7/0 )] =0 for rj close to r]Q. Then the mean-square continuity as- 
sumption (Assumption 5.3) in [21] also holds with a(z) = 0. By Lemma 5.1 
in [21], (3 l thus has the same distribution as the solution to the equation 

n 

^M(z 4 ,/3 1 ,7/ )-nC 1 = 0. 

8=1 

A straightforward simplification yields the result. □ 
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